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METHOD OF DETERMINING THREE-DIMENSIONAL PROTEIN 
STRUCTURE FROM PRIMARY PROTEIN SEQUENCE 



This application claims the benefit of and incorporates by reference herein 
U.S. Provisional Application No. 60/044,124, filed April 22, 1997 entitled 
"Method of Determining Three-Dimensional Protein Structure From Primary 
Sequence" by inventors William A. Goddard III and Derek A. Debe. 

The U.S. Government has certain rights in this invention pursuant to Grant 
No. CHE-95-22179 and ACS-92-17368 awarded by the National Science 
Foundation. 



BACKGROUND 

The present invention generally relates to methods for determining a protein's 
three-dimensional structure from its amino acid sequence. More particularly, 
the present invention relates to methods for generating a sequence 
independent ensemble of folded topologies for a n residue protein which can 
then be used as a starting point in protein structure prediction. Both sequence 
dependent and sequence independent methods for reducing the number of 
potential conformations are also described. 

Since the seminal work by C.B. Anfinsen, determining the three-dimensional 
structure of a protein from its amino acid sequence has been a much sought 
after goal in structural and computational biology. However, although 
progress has been made in several fronts such as secondary structure 

1 
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prediction and homology modeling, a general method for ab initio structure 
prediction, or in other words, a solution to the so-called "protein folding 
problem", has eluded investigators. 

In general, two reasons are most often cited for the intractability of this 
problem. The first relates to the combinatorics involved when attempting to 
exhaustively enumerate all possible conformations even when simplifying 
assumptions are made. For example, assuming only three conformations per 
amino acid residue, a relatively small protein of a 100 residue would have 
approximately 10 48 (or 3 100 ) potential conformations. Since proteins i;i their 
native environments are known to fold in millisecond timescales, the apparent 
paradox (termed LevinthaPs paradox) on how a protein arrives at its "correct" 
three-dimensional structure without systematically sampling the inordinately 
large number of potential conformations has been the subject of intense 
debate. 

Notwithstanding the apparent paradox, various methods, such as molecular 
dynamics ("MD"), Monte Carlo, and Genetic Algorithm ("GA"), have been 
developed to sample the available conformational space. Since it was 
generally assumed that an exhaustive enumeration of the possible 
conformations was not possible, the goal of these methods have been to 
sufficiently sample the available conformation space so that a structure 
corresponding to the "native" protein may be found among the candidate 
structures. However, what constitutes sufficient sampling remains an open 
question. 

The second reason for the intractability of the folding problem relates to 
recognition of the "native" protein structure among the candidate set. 
Although numerous recognition methods are described in the literature, they 
are not generally suitable for use with low resolution structures like those 
generated by ab initio modeling procedures. 
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For example, existing methods are often based on contact potentials. These 
methods typically use some form of a reduced representation of the protein 
such as an a and p carbon sphere model. In this representation, the peptide 
backbone is approximated by an a-carbon sphere and the amino acid 
sidechains are approximated by the 20 unique p-carbon spheres. The energy 
of the protein, E, is then approximated by the sum of all of the individual 
contact energies such that, 



Values of the e terms are compiled according to the statistical relation: 



where ny is the number of contacts (interactions within a predetermined 
cutoff distance) between interacting centers / and j, and n ijexp is the number 
of these contacts expected in a random distribution. 

Because structures generated by ab initio modeling procedures tend to be low 
resolution structures, contact potential based methods do not adequately 
discriminate between structures that have the correct overall fold from those 
that do not. Since most of these methods were developed to recognize the 
native crystal structure from a group of grossly misfolded decoys generated 
by threading the native sequence over all segments of equal length of 
available in the Protein Data Bank, their poor performance on lower 
resolution structures is not surprising. 



contact 
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Various attempts have been made to modify the contact potential based 
methods for use with ab initio modeling procedures. However, because local 
interactions are still emphasized over global protein architectures, these 
methods still are generally unsuitable for use with structures generated from 
5 ab initio modeling procedures. 

Thus, an important feature of a recognition method for use in an ab initio 
context is the de-emphasis of local energies or contacts in favor of global 
protein architecture. The challenge presented is to accurately estimate the 
locally minimum energy accessible by any structure that is similar in 
10 conformation to the examined structure. 

Despite the overwhelming challenges, there is an increasing need for accurate 
protein structure prediction methodologies as the number of known primary 
sequences continues to far outpace the number of solved three dimensional 
structures. This need is exacerbated since analytical solution of proteins of 
interest by either NMR or X-ray crystallography are not always possible due 
to experimental difficulties such as protein insolubility or insufficient 
quantities of purified protein. Since determining the three-dimensional 
structure is often the key in elucidating the function and mechanism of the 
protein of interest (thereby allowing researchers, for example, to discover new 
drugs or more potent drugs), an accurate and easy to use method for ab initio 
structure prediction would be an invaluable tool 

SUMMARY OF THE INVENTION 

The present invention presents a novel approach to the protein folding 
problem by reframing the question in terms of folded topologies. In other 
25 words, instead of starting with the entirety of the available conformational 

space, the starting point for the inventive method is the subset of 
conformational space that represents distinct, self-avoiding, folded topologies. 

4 
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By reframing the question, the number of conformations that needs to be 
considered is dramatically reduced from 3 n to approximately 1.2 n . 

In addition, novel recognition procedures based only on the a-carbon position 
have also been developed. These recognition techniques analyze a protein 
structure to determine whether or not the hydrophobic residues are capable of 
accessing the protein center, and whether the charged residues are capable of 
accessing solvent exposed protein exterior. Because energy scores obtained 
by these methods are invariant over a larger CRMS range from the native 
structure, these methods are better suited for use with ab initio procedures. 

BRIEF DESCRIPTION OF THE DRAWINGS 

Figure 1 is a schematic of the inventive ab initio protein modeling protocol 
wherein the three dimensional protein structure is modeled based on its amino 
acid (or even gene sequence). 

Figure 2 is an illustration of certain variables used in one of the ensemble 
reduction embodiments. The residues are designated as hydrophobic and 
non-hydrophobic (or hydrophilic). For each hydrophobic residue, a vector is 
calculated from the protein center of mass to the Ca of the hydrophobic 
residue. For each hydrophilic residue, a cone is drawn with its vertex as the 
protein center of mass that encompasses the hydrophilic residue of interest. 

Figure 3 is the CRMS data between the ab initio generated structure (GP 
equivalent) and the 277 native proteins and protein domains considered. 

Figure 4 is a plot of the GP ensemble size required to sample each and every 
native folded topology as a function of polypeptide length n. 
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Figure 5 is a plot of the maximum time scale for protein folding versus the 
protein length based on a diffusion constant, D=3xl0" 6 cm 2 /sec (or one new 
topology approximately every 0.3 nanosecond for a 100 residue protein). 

Figure 6 is a hierarchical strategy for ab initio protein folding. The 
5 conformation search at each step is greatly reduced due to coarse grain 

eliminations of conformations at the previous levels. The GP method coupled 
with an appropriate recognition algorithm produces a manageable set of 
candidates which contains the native folded topology. The time scales shown 
are estimates for a single processor Silicon Graphics Inc. (SGI) workstation. 

Figure 7 is the comparisons of the GP structures superimposed on the 
corresponding native protein backbone, (a) 65-residue segment from the 
NMR determined structure of the proteolytic fragment from 
Bacteriorhodopsin (lbct): the GP structure has a CRMS fit of 5.78 A and the 
refined structure has a CRMS of 4.35 A; (b) 65 residue Porcine C5a (lc5a): 
the GP structure has a CRMS fit of 5.40 A and the refined structure has a 
CRMS of 3.91 A; (c) 80 residue fragment from acyl-coenzyme A binding 
protein (laca): the GP structure has a CRMS fit of 6.12 A and the refined 
structure has a CRMS of 4.97 A; and, (d) 80 residue segment from domain 
four of the N-terminal domain of 70 kD heatshock cognate protein (lhpm04): 
the GP structure has a CRMS fit of 6.14 A and the refined structure has a 
CRMS of 4.22 A. 

Figure 8 illustrates one implementation of a distance constrained method for 
generating protein structures. 

Figure 9 is a flow chart of one implementation of the enrichment/replication 
25 process. 
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DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS 

The present invention generally relates to methods for determining a protein's 
three-dimensional structure from its amino acid sequence. More particularly, 
the present invention relates to methods for generating a sequence 
5 independent ensemble of folded topologies for a n residue protein which can 

then be used as a starting point in a hierarchical approach to protein structure 
prediction. Novel recognition methods which are more suitable for use with 
ab initio structure prediction procedures are also described. 

The inventive methods may be implemented by being programmed into and 
executed on a computer and includes the three general steps. The first 
involves generating an ensemble of all possible tertiary folds for a n residue 
protein. The exhaustive enumeration is possible because the number of self- 
avoiding folded conformations is substantially smaller than all possible 
conformations for a n residue protein. Moreover, because the conformation 
set is only dependent on the number of residues, n, the initial conformation 
set is entirely sequence independent. 

Once the initial conformation set is created, the second step involves reducing 
the number of potential structures by considering sequence specific 
information. Methods which may be used include a novel recognition 
protocol based only on a-carbon positions. The remaining structures may 
then be further refined using any number of techniques known in the art 
including sophisticated energy calculation (including explicit solvent) on full 
atom representations of the protein. Figure 1 is a schematic illustration of the 
procedures involved in obtaining a three-dimensional structure from a 
25 protein's primary sequence. 

Each step, ensemble generation, ensemble reduction, and ensemble 
refinement, will be further discussed in turn. 
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Ensemble Generation 

Although it is not necessary for the implementation of the present invention, 
for computational expediency, a reduced representation of the protein is 
generally preferred. One example of a reduced representation is the main 
5 atoms of the peptide backbone (N, Ca, C, and O). Another example is an a- 

carbon backbone and pseudo sidechain representation where the different 
amino acid sidechains are represented by a vector from the a-carbon to a 
pseudo Cp position. However, because the ensemble generation is 
independent of the amino acid sequence of the protein, the use of a peptide 
10 backbone or an a-carbon only representation is most preferred. 

The two general assumptions made in the implementation of the inventive 
methods are that conformations with overlapping peptide chains are 
inordinately high in energy and that non-neighboring spatially adjacent 
residues are subject to attractive van der Waals interactions. Because all 
15 amino acids are preferably treated alike at this stage, the ensemble generated 

for n residues will be the set of initial candidate structures for any protein 
having n residues regardless of its amino acid sequence. 



Although an infinite number of \\> dihedral angle combinations are 
theoretically possible when building a peptide backbone, a finite set is 

20 typically used for computational expediency. In preferred embodiments, the 

finite set of allowed dihedral angles represent the most energetically favorable 
and most populated regions of the Ramachandran plot. An illustrative 
example of a finite set is the six (j> v|/ dihedral angles that are used in the most 
preferred embodiments: (-65, -42); (-123, 139); (-70, 138); (-87, -47); (77, 

25 22); and (107, -174). Regardless of the set of <> vj/ dihedral angles, bond 

lengths and bond angles are fixed to standard values and the peptide torsion 
angle co is fixed to 180° for all residues. 
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If an cc-carbon only representation is used, a finite set of dihedral angles (as 
defined by Ca|-Ca i+1 -Ca i+ 2-Caj + 3) may be similarly calculated by generating 
a Ramachandran-like plot for Ca dihedrals that are found in solved protein 
structures. In this version of the buildup procedure, the Ca-Ca distance is 
5 fixed at preferably about 3.8 A. 



As described previously, the initial set of ensemble structures represents an 
exhaustive enumeration of the possible tertiary folds for a n residue protein. 
To this end, any suitable sampling technique may be used. Because of the 
general preference for reduced representation, methods developed for general 

10 polymers may also be used with the inventive method with little or no 

modification. Several such techniques are described in greater detail in the 
Experimental Section. However, because of its superior sampling efficiency, 
the method referred to as the Continuous Configuration Boltzmann Biased 
Direct Monte Carlo Method as described by Sadanobu and Goddard in J. 

15 Chem. Phys. 106: 6722 (1997) and incorporated herein by reference in its 

entirety, is most preferred. 



For the purposes of clarity, the residue buildup procedure will be explained 
using a peptide backbone representation from this point forward. However, it 
is to be understood, that any representation may be used and that persons 
20 skilled in the art would readily be able to adapt the described methods for the 

particular representation. 



The first step in the residue by residue build up procedure involves selecting 
a residue position from the amino acid sequence of the protein. The choice 
of the initial residue is not critical and may be at any point along the 
25 sequence. In preferred embodiments, a residue in the middle of the sequence 

is chosen. The coordinates for the first residue may be fixed at any point in 
the available coordinate space using standard bond lengths and angles for N, 
Ca, C, and O and setting the peptide torsion angle, co, at 1 80°. 
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Once the first residue is chosen, residues are added one at a time until the 
entire protein is constructed. With the exception of the first residue, $ \\i 
angles must be assigned to each residue that is added to the growing 
fragment- Although any suitable method may be used, a Metropolis based 
5 method is generally preferred where the probability of selecting one of the 

one of the available pairs of dihedral angles is governed by the ratio of the 
Boltzmann energy for the individual dihedral pair over the sum of the 
Boltzmann energies for all of the available dihedral pairs. Assuming that the 
set of dihedral angles is limited to six as in the most preferred embodiments, 
10 then the probability is governed by the equation: 




Once the <j> y dihedral angles are determined, the residue is added to the 
existing fragment. The energy of the fragment with the added residue is then 
calculated. Although any suitable method for evaluating the energy of the 
growing fragment may be used, the use of pair wise interaction energies is 
15 generally preferred. Moreover, because the non-bonded interactions falls 

dramatically within a short distance, the use of a specified cut-off distance is 
also preferred. A suitable cut-off distance from the added residue is between 
about 8 A and about 10 A. 

Examples of methods for calculating nonbonded interaction energies include 
20 but are not limited to Lennard- Jones 12-6 potentials, Lennard- Jones 8-4 

potentials, Morse Potentials, and Exponential-6 potentials. However, the 
Lennard- Jones 12-6 potential is generally preferred. 
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Because the energy calculation is to prevent the occurrence of overlapping 
residues, for computational expediency, it is most preferred to perform the 
energy calculations based on ct-carbon positions only even if a more complete 
representation of the protein is used to build the model (i.e., peptide 
backbone). Consequently, in the most preferred embodiments, the nonbonded 
energy for adding residue /, E j5 is given by the sum of its pair-wise . 
interaction energies of the a-carbons of the peptide fragment using the 
Lennard- Jones 12-6 potential: 



-2 



wherein R is the distance between the a-carbons of each residue; i and j are 
10 non-adjacent neighbors in sequence; and E 0 and Rq are set to predetermined 

values. 



If the added residue results in overlapping peptide backbones (as determined 
by the unfavorable energy of the fragment), protein conformations containing 
the resulting fragment are no longer pursued. This residue adding procedure 
15 in then repeated until the entire protein is constructed. 

In preferred embodiments, Rq is set between about 5 and 6 angstroms for all 
residue types, and E 0 is set between about 0.1 and 0.2 kcal/mol. In the most 
preferred embodiment, Rq is set at about 5.5 angstroms for all residue types; 
E 0 is about 0.15 kcal/mol. However, the exact values of Rq and E Q are not 
20 critical to the method and may be set to other values. Moreover, sequence 

dependency may also be introduced into the energy calculation, if desired. 
For example, a different E 0 may be used for each of the twenty amino acid 
residues. In another variation, residues are designated as either hydrophobic 
and hydrophilic based on known methods and three different energies (Eq's) 
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are assigned. As with the sequence independent E 0 , the precise energy values 
for the sequence dependent E 0 's are not critical and may be determined by 
any number of heuristic methods known in the art. However, one example of 
suitable energies are: E 0 = 0.15 kcal/mol for an interaction between two 
5 hydrophilic (or polar) residues; E 0 = 0.20 kcal/mol for an interaction between 

a hydrophilic and a hydrophobic (non-polar) residue; and E 0 = 0.25 kcal/mol 
for an interaction between two hydrophobic residues. 

Although it is not necessary to the practice of the invention, the use of an 
enrichment technique is preferred. In general, after a residue is added to the 
10 fragment, then m copies of the fragment is governed by: 

int[(z/<^>y(2^<^-i>M 

wherein z x = exp(-E/kt) which corresponds to the Boltzmann factor for the 
particular fragment in which the /th residue has just been added (thus 
resulting in an / residue fragment wherein i > 1); <z > is the accumulated 
15 average Boltzmann factor for all the fragments in the ensemble having / 

number of residues; z H is the Boltzmann factor for the particular fragment 
without the /th residue; and <Zj_,> is the accumulated average Boltzmann 
factor for all fragments in the ensemble having Ul number of 
residues. Each copy is then grown independently. 

10 The enrichment method incorporates a novel memory saving algorithm which 

is further described in the Experimental Methods section. Briefly, after a 
complete protein chain is constructed, a chain generation counter, Fj = mj is 
calculated for each residue /. This protein chain is stored in memory. 
Starting at the end of the protein chain, the method backtracks through the 

15 residue addition steps in the opposite order in which the residues were added 

until a value of m k > 1 is found for some residue k. The portion of the 
protein from residue /=1 to i=k is replicated and a new offspring protein 
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chain is constructed by adding residues to the replicated fragment. When the 
new offspring protein chain is completed, enrichment factors for each residue 
in the offspring are calculated and the chain generation counter at residue k in 
the parent chain is reduced by 1. The method again backtracks through the 
residue addition steps in the parent chain until another residue whose 
enrichment factor is greater than one is found. This procedure continues until 
the chain generation counter is 0 or 1 for each residue in the parent. 

Ensemble Reduction 

Once the ensemble of initial candidate tertiary structures are generated for a n 
residue protein, the next step in the process is to reduce the number of 
candidate structures. The first class of recognition filters is sequence 
independent and generally relate to the observation that proteins tend to be 
compact structures. This characteristic may be quantified using a number of 
known methods including the radius of gyration and moments of inertia. 
However, because the values are more easily calculated, methods based on 
using the radius of gyration are generally preferred. 

A powerful sequence independent filter is to exclude those structures that do 
not have native-like radius of gyration values. In preferred embodiments, 
only those members of the ensemble with a radius of gyration between 

5 l *^mm ^ 5 2 -R min are selected wherein is determined by the 
following 



predetermined values. Preferred values for 5j are between about 0.9 and 1 
with about .95 being the most preferred. Similarly, preferred values for 5 2 
are between about 1.4 and about 1.5 with about 1.3 being the most preferred. 




n r is the number of amino acid residues in the protein, and 8j and 5 2 are 
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However, the exact values of b x and 6 2 are not critical and their values may 
be outside of this range. 

This relatively simple calculation is able to eliminate approximately 70% of 
the structures in the initial ensemble set when 5j and 5 2 are set to preferred 
5 values. Because native protein structures have radius of gyration values 

which are at least about 10 to 15% above this minimum threshold, the 
likelihood of the correct candidate structure being eliminated by this criterion 
is minimal. 

A second class of recognition filters is sequence dependent which are either 
10 based on distance constraints or heuristic observations of native protein 

structures. Illustrative examples of distance constraints include but are not 
limited to the spatial proximity necessary between non-adjacent residues in 
sequence to satisfy disulfide bond requirements, metal coordination site 
requirements, and NMR derived NOE constraints. When known distance 
1 5 constraints are applied, only those members that satisfy the relative spatial 

arrangement of two or more residues are selected for further consideration. 

Although these distance-based constraints are used in the ensemble reduction 
phase in preferred embodiments, they may also be used as part of the 
ensemble generation process. For example, if at least one of the necessary 

20 distance constraints are not met in the growing fragment, such as two 

cysteine residues which form a known disulfide bridge not being spatially 
adjacent (i.e., Cot's within about 9 A), then that cluster based upon the 
particular fragment would be terminated. However, it should be apparent that 
the end result is the same regardless of whether the distance based constraints 

25 are applied during the initial ensemble generation or whether the distance 

based constraints are applied after all possible tertiary folds for the n residue 
protein are generated. 
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An example of a heuristic based criteria is the observation that hydrophobic 
residues in the protein must be able to access the protein core, and charged 
residues in the protein must be able to access the protein surface. Because of 
the extensive calculations involved and the necessity of having more refined 
5 candidate structures that includes at least pseudo-sidechain positions, the 

tendency for amino acid residues to partition in this manner has not been 
generally adapted for routine use in ab initio modeling protocols. 

To remedy these drawbacks, the present invention describes two procedures 
based only on Ca positions. The implementation of each procedure requires 

10 designating amino acid residues into at least two categories, hydrophobic and 

hydrophilic (or non-hydrophobic). Any standard method for assigning 
hydrophobicities may be used including but are not limited to methods 
described by Kyte & Doolittle, Kauzmann, Nozaki & Tanford, Eisenberg, 
Chothia, and Huang & Levitt. In preferred embodiments, hydrophobic and 

15 hydrophilic residues are defined as described by Huang et al in J. Mol. Biol. 

252: 709-720 (1995) and J. Mol. Biol. 257: 716-722 (1996), both of which 
are incorporated by reference it their entirety herein. 

The first method measures the ability of a hydrophobic residue to access the 
core. In a preferred embodiment, the center of mass of the candidate protein 

20 structure is calculated from the Ca positions. If a residue is hydrophobic, 

then a vector from the center of mass to the residue is constructed. A 
hydrophobic residue is deemed incapable of accessing the protein center and 
thus receives an energy penalty if another residue within a cutoff value is 
present between the center of mass and the hydrophobic residue. Preferred 

25 values for the cutoff is between about 0.4 and 0.6 A, and more preferably 

about 0.5 A. Generally this penalty is expressed energetically by assigning a 
positive energy value, preferably about 4 kcal/mol, for any hydrophobic 
residue between 0.95 and 1.3 of the center of mass. is a 
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10 



15 



20 



function of the number of amino acid residues in the protein, n r , and is 
defined as 



In more preferred embodiments, in addition to a hydrophobic penalty, 
hydrophobic residues that are able to access the core are assessed a favorable 
energy value. Illustrative examples of such a scheme include but are not 
limited to: 

i) assigning a positive energy value, preferably about 4 kcal/mol, for 
any hydrophobic residue between 0.95 and L3 of the center of 
mass which is unable to access the center; 

ii) assigning a negative energy value, preferably about -15 kcal/mol, 
for any hydrophobic residue within 0.95 of the center of mass; and, 

iii) assigning a less negative energy value than in ii), preferably about 
-10 kcal/mol, for any hydrophobic residue between 0.95 and 1.3 R^ of 
the center of mass which is able to access the center. 

In an even more preferred embodiment, the ability of hydrophilic residues to 
access the surface is evaluated in addition to the ability of the hydrophobic 
residues to access the core. The hydrophobic residues are treated as 
described above. As for the hydrophilic residue, instead of a vector, a cone 
is constructed from the center of mass as its vertex, preferably with a cone 
angle of 5 degrees, that encompasses the hydrophilic residue. A hydrophilic 
residue is deemed incapable of accessing the surface if the extension of the 
cone beyond the hydrophilic residue contains another residue. An illustrative 
set of energy parameters are: 

i) assigning a positive energy value, preferably about 4 kcal/mol, for 
any hydrophobic residue between 0.95 R^ and L3 R^ of the center of 
mass that is unable to access the center; 
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ii) assigning a negative energy value, preferably about -15 kcal/mol, 
for any hydrophobic residue within 0.95 of the center of mass; 

iii) assigning a less negative energy value than in ii), preferably about 
-10 kcal/mol, for any hydrophobic residue between 0.95 and 1.3 of 

5 the center of mass that is able to access the center; and, 

iv) assigning a positive energy value, preferably about 2 kcal/mol, for 
any hydrophilic residue that is incapable of accessing the surface. 

In another variation of the first method, the center of mass of the protein and 
R min are calculated as described above but uses a simplified scoring system 

10 based upon the identities of particular residues. If a particular hydrophobic 

residue is within a predetermined distance from the center of mass, then it 
receives a score of -1 . However, if the residue is outside the predetermined 
distance, then it receives a penalty score of +2. The hydrophobic distance 
cutoffs are: 1.2 for phenylalanine and isoleucine; 1.25 R min for leucine 

15 and valine; and 1.3 R^ for cysteine. In contrast, if a particular hydrophilic 

residue is within a predetermine distance from the center of mass, then it 
receives a penalty score of +2. However, if the hydrophilic residue is outside 
of this distance, then it receives a score of -1. The hydrophilic distance 
cutoffs are: 0.85 for aspartic acid; 0.8 for asparagine, glutamine, glutamic 

20 acid, lysine, proline, and serine; and 0.75 for arginine. If desired, an even 

more elaborate method may be used wherein the environment of the nearest 
sequence neighbors are taken into account or wherein a smooth sigmoid 
function replaces the strict distance cutoffs. 

In each of the above described methods, the energy of the candidate 
25 conformation is calculated by summing the energies of the individual 

residues. A predetermined number of conformations having the lowest 
energies is then selected for further refinement. 
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Developed by Huang et al at Stanford, the second method attempts to 
measure a hydrophobic fitness score by counting the number of hydrophobic 
contacts ("hydrophobic term") and the degree to which hydrophobic residues 
are buried ("burial term"). The procedure works with either an all Ca or 
5 with a Ca and pseudo Cp protein representation. Although the method will 

be described in terms of the latter, it may be readily adapted to work with an 
a Ca representation by substituting Ca-Ca distances instead of pseudo 
sidechain distances. 

The two terms, Hydrophobic Term and the Burial Term, are defined as 
10 follows: 

£ {H r H^ nce ) 

HydrophobicTerm = — 

n 

BurialTerm = — . 

n 



where for each hydrophobic sidechain / (as defined by Cp), H { is the number 
of neighboring hydrophobic sidechains within a specified distance from /, 
preferably about 7.3 A, H| Chance is the number of hydrophobic sidechain 
contacts which would be expected to occur strictly by chance, and B; is the 
15 total number of neighboring sidechains within a specified distance from 

residue /, preferably about 10 A. The hydrophobic fitness score, HF, is 
defined as, 

HF = (Hydrophobic Term) x (Burial Term). 



If desired, candidate conformations may be minimized as a function of HF, 
20 while preserving the overall tertiary conformation. In this procedure, each of 

the hydrophobic sidechains is directed towards the center of mass of the 
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protein while directing the hydrophilic residues away from the protein center. 
The center of mass of the hydrophobic residues is then calculated. 
Sidechains of a specified length, preferably about 3 A, are then placed on 
each hydrophobic residue and all of these sidechains are directed towards the 

5 hydrophobic center of mass. Hydrophobic sidechains not within 6 A of the 

hydrophobic center are directed away form the protein center of mass and the 
hydrophobic center of mass is recalculated. This step ensures that only those 
residues in a single, compact hydrophobic core are included. Sidechains for 
hydrophilic amino acids are directed away from the hydrophobic center of 

10 mass. At the end of the run, the candidate structures are ordered according to 

their HF score and a predetermined number or percentage of members of the 
ensemble having the best HF scores are then selected. 



Ensemble Refinement 

The last major step in the inventive ab initio modeling procedure is further 
1 5 refining the remaining candidate structures. In general, approximately 

between about 100 and about 1000 candidate structures are expected to 
remain in the ensemble at this point. 

High level refinement may be carried out using any of the known molecular 
mechanics minimization and molecular dynamics simulation methods. Full 

20 atom sidechains may be added to the backbone template structures in a 

computationally efficient manner using sidechain rotamer libraries. 
Illustrative examples of suitable rotamer libraries include but are not limited 
to those described by Ponder and Richards, J. Mol. Biol. 193: 775-791 (1987) 
and Dunbrack and Karplus, J. Mol. Biol. 230: 543-574 (1993). In preferred 

25 embodiments, the energy of solvation (the interaction of the structure with the 

solvent molecules in solution) is considered either explicitly or through 
known statistical mechanical formulations. Because the numbers will be 
sufficiently small at this stage, simulations may be carried out on all 



19 



WO 98/48270 



PCT/US98/08077 



remaining members of the ensemble to determine the minimum energy 
configuration. 



EXAMPLE 1 

Potential Solution to Levinthal's Paradox 
5 The size of the complete set of topologically distinct conformations for a n 

residue polypeptide was determined by generating ensembles of protein 
structures using the Generic Protein Direct Monte Carlo method. This 
method applies the CCBB direct Monte Carlo growth technique in 
conjunction with a generic energy function and peptide representation that 
10 treat all amino acid types identically. Because the energy expression is not 

dependent on amino acid sequence identify, a generic protein ("GP") 
ensemble contains a highly diverse set of self-avoiding protein conformations. 



In order to determine how large an ensemble must be to sample all self- 
avoiding folded topologies for a n residue protein, test sets were compiled of 

15 about 20 native proteins and protein domains for residue number, n = 20, 25, 

30, 35, 40, 45, 50, 55, 60, 65, 70, 80, and 100 residues. The size of the 
complete set of topologically distinct self-avoiding protein folds was 
estimated by determining how many GP conformations were required to find 
a near-native conformation for all each member of the test set (see also 

20 Figure 3). The result of this experiment is that the number required in the 

GP ensemble scales as approximately 1.2" which is substantially slower than 
the benchmark figure of 3 n . Figure 4 is a graphical illustration of this point. 



The Levinthal Paradox is founded on the assumption that there are 3 n 
conformation states for a n residue polypeptide. Because it was generally 
25 assumed that a polypeptide could not sample more than 10 13 conformations 

in one second, sampling all 3 100 or 10 48 states estimated for a 100 residue 
peptide was not believed possible. The paradox arises because despite the 
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staggering number of potential conformations, proteins nevertheless are able 
to find the global energy structure within millisecond timescales. 

Although a directed reaction pathway was suggested by Levinthal as a 
potential solution, recent experiments have shown that a restrictive single 

5 pathway model is not required. The key reduction in states might also be 

achieved by a multiple pathway mechanism, akin to a funnel. However, 
since even the highly ordered native state is only marginally more stable than 
an unfolded state, these models do not suggest a plausible explanation for 
how an early folding mechanism precludes the protein from sampling vast 

10 regions of unproductive conformation space. 

The GP folding studies suggest an alternative solution to the Paradox. For 
example, it is estimated that there are 3x1 0 7 topologically distinct 
conformations for a 100 residue peptide. Assuming an average sampling rate 
of one new state per 0.3 nanosecond, it is estimated that 3x1 0 7 states (the 

15 number of all folded conformations estimated for a 100 residue protein) can 

be sampled in approximately 10 milliseconds. Thus, proteins up to 160 
residues can fold by random sampling in one second. Since domains are 
typically no larger than 200 residues, it is possible for even multi-domain 
proteins to fold in sub-second time scales if the nucleation of locally static 

20 secondary structure within each domain reduces the net conformational 

freedom to the equivalent of a 160 residue unconstrained peptide. 

Arguments against such a random picture of folding are sometimes based on 
the experimental observation that fragments excised from proteins and kinetic 
folding intermediates possess native-like secondary structure. However, such 
25 biases are not required for small proteins or domains to fold. For example, 

86 amino acid reduced HIV-1 Tat (trans-activator) protein folds to a structure 
with a well-defined core, yet possesses no secondary structure or disulfide 
bonds. 
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The GP folding studies are believed to be consistent with recent experimental 
findings that protein folding is not generally confined to a single reaction 
pathway with readily identifiable intermediates. These results provide a 
plausible explanation of not only multiple folding pathways but also how 
5 even large proteins are able to fold to their unique three dimensional 

conformations in sub-second time scales. 

EXAMPLE 2 

Restricting the available conformational space to self-avoiding folded 
topologies provides the crucial reduction in the number of potential 

10 conformations in the ab initio folding problem. Figure 7 illustrates four 

proteins in which the inventive method of used: (a) 65-residue segment from 
the NMR determined structure of the proteolytic fragment from 
Bacteriorhodopsin (lbct); (b) 65 residue Porcine C5a (lc5a); (c) 80 residue 
fragment from acyl-coenzyme A binding protein (laca); and, (d) 80 residue 

15 segment from domain four of the N-terminal domain of 70 kD heatshock 

cognate protein (lhpm04). 

The superimposition of the native structure with the corresponding GP 
equivalent results in the following CRMS values: (a) 65-residue segment from 
the NMR determined structure of the proteolytic fragment from 

20 Bacteriorhodopsin (lbct): the GP structure has a CRMS fit of 5.78 A ; (b) 65 

residue Porcine C5a (lc5a): the GP structure has a CRMS fit of 5.40 A; (c) 
80 residue fragment from acyl-coenzyme A binding protein (laca): the GP 
structure has a CRMS fit of 6.12 A; and, (d) 80 residue segment from 
domain four of the N-terminal domain of 70 kD heatshock cognate protein 

25 (lhpm04): the GP structure has a CRMS fit of 6.14 A. 

Refining the corresponding GP structure results in a structure with the 
following CRMS values when compared with the native structure: (a) 65- 
residue segment from the NMR determined structure of the proteolytic 
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fragment from Bacteriorhodopsin (Ibct): refined structure has a CRMS of 
4.35 A; (b) 65 residue Porcine C5a (lc5a): the refined structure has a CRMS 
of 3.91 A; (c) 80 residue fragment from acyl-coenzyme A binding protein 
(laca): the refined structure has a CRMS of 4.97 A; and, (d) 80 residue 
5 segment from domain four of the N-terminal domain of 70 kD heatshock 

cognate protein (lhpm04): the refined structure has a CRMS of 4.22 A. 

As illustrated by Figure 7, a GP equivalent to the native folded topology is 
found using the inventive methods. However, it is evident that native local 
backbone configurations, such as secondary structure, are not well described. 

10 This is due to the crude energy function which does not take into account 

hydrogen bonding or other sequence specific interactions during the initial 
stages of the protocol. However, the lack of secondary structure during the 
ensemble generation process is not disconcerting since methods for including 
sequence specific information may be incorporated during the various stages 

15 of refinement. As also illustrated by the superimposition of the native with 

the refined structure in Figure 7, local refinement of GP folded topologies 
can result in conformations that are both globally and locally similar to native 
protein structures. 

As a result, coupled with various selection techniques, the use of the GP 
20 method significantly reduces the complexity of the ab initio folding problem 

to one of selecting and/or refining a set of structures by incorporating 
sequence specific information. 

EXPERIMENTAL METHODS 

Com pilation of native protein sets 
25 In order to increase the number of experimental structures for each peptide 

length w, we included longer structures truncated at the carbonyl terminus. 
For example, the test set for n = 45 consist of residues 1-45 from available 
protein structures length 45-49. In instances where the coordinate file 
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contained more than one set of coordinates for a given structure, the first set 
was used. 



The proteins are identified by either the protein identifiers used by the 
Brookhaven Databank or the CATH database. 

5 The 20, 25, and 30 residue proteins were constructed using the 20 parent 

proteins: laph, lcbh, lchl, lcld, lcta, Idee, ldfn, ldmc, lerp, lfct, lktx, 
lpnh, lppt, lsis, lsxm, 2achB, 2mhu, 2pgd03, 4cpal, 7znf. 

35mers (20): laml, lapo, lcbh, lchl, Idee, lerd, lerp, lica, lktx, lltsC, 
lolgA, lpcp02, lpptA4, lr0904, lsis, lsxm, 2achB, 2pgd03, 4cpal, 9wgaAl. 



10 40mers (21): 125d, laml, lapo, lbds, leptA, lerd, lfc2C, lhev, lhtrP, 

lhymA, lica, lltsC, lolg, lpcp02, lpoxA4, lr0904, Ires, lzaq, 2erl, 2pdd, 
9wgaAl. 

45mers (18): lahl, latx, lbia03, lcrn, lehs, lgln03, lgps, lhnr, lhuc, lilk02, 
liva, lloeB, lmylA, loma, lpdc, lshl, lyrnA, 2ech. 

15 50mers (21): lafp, lbal, lbrbl, legf, lenh, lfdx, lhcgB, llccA, lmbe, 

lncfA2, lptq, lraaB2, lsgp, ltfi, ltih, ltpm, 2atcB2, 2tgf, 3monB, 4sgbl, 
6insE. 



55mers (23): laaf, lamg02, lamy02, lbbo, lbpbOl, lctm02, ld66A, ldrs, 
lfca, lgfc, lhcc, llyaBl, lpdnC2, lpgb, lpnrAl, lprlC, lysaC, 2baa02, 
20 2mev4, 2reb02, 3aahB, 3ovo, 5pti. 

60mers (23): lata, lcsel, Idem, lfxrA, IgatA, lhdp, lhfh, ligd, lisuA, 
lmdyB, lnra, lntx, lpce, lpi2, lr69, lrhpA, lrpo, IscmA, lsso, ltrlA, 
2drpA, 2hntE, 4mt2. 
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65mers (24): lahdP, lbct, lbfmA, lbhb, lc5a, Ichc, lcis, IcopD, lctf, lhre, 
lhrt, lkbaA, lkst, lmjc, ImntA, InapA, locp, Ipse, IrtnA, lsap, Istu, 
IwapA, 2cro, 2sn3. 

70mers(21): lbbi, Ibod, lbpb03, lcksA, lftz, lfvl, lgbrA, IhcqA, lhma, 
5 Ihoe, lhpi, lhstA, Ilea, lneq, Into, loctC, losa02, lpkpOl, IspbP, lutg, 

2hsp. 



80mers (24): laba, laca, lapa02, lbgh, lctl, lcyg03, lcyi, Icyo, leptB, 
lgtrA3, lhip, lhpm02, lhpm04, Ihra, llab, Ipba, Ipht, lpoh, lpyaA, Itig, 
ltiv, 2dln01, 2fxb, 2gcr01. 

10 lOOmers (22): laj, lab2, lacx, Ibet, lcmbA, letc, lfd2, lfkb, lfus, lhks, 

lhrc, lltsD, lone, lpal, lput, lthx, Itlk, lyce, 2atcB, 2cdv, 2imn, 2pna. 



Generic Protein Structure Generation 

Using the Generic Protein Direct Monte Carlo method, a complete set of 
peptide backbone coordinates for a protein is constructed by consecutively 

15 adding residues in one of six possible conformation states to a single residue 

which is selected at random. In preferred embodiments, the first residue is at 
the center of the peptide sequence. The six possible conformations 
correspond to six pairs of $ \|/ dihedral angles are chosen because they 
represent the most energetically favorable and most populated regions of the 

20 Ramachandran plot. However, any number energetically favorable $ y pairs 

may be used. 

An illustrative example of six pairs of <|> y dihedral angles are: (-65, -42); 
(-123, 139); (-70, 138); (-87, -47); (77, 22); and (107, -174). Bond lengths 
and angles are fixed to standard values and the peptide torsion angle co is 
25 fixed to 180° for all residues. 
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During buildup procedures, the probability of selecting one of the available 
pairs of dihedral angles is governed by: 

-Ej/RT 



1 £ 



-EJRT 

e 1 



The addition energy, Ej, of a single residue is given by the summation of its 
pair-wise interaction energies with each residue in the peptide fragment. For 
all residues types, the energy of a residue pair is: 



E nb (R)=E c 
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wherein Rq is set at 5.5 angstroms for all residue types; E 0 is 0.15 kcal/mol; 
R is the distance between the a-carbon of each residue; and / and j are not 
adjacent neighbors in sequence. 

In preferred embodiments, energetically favorable addition steps are replicated 
10 by a factor m which is equal to 

inttCzi^z^)^.!^.^)] 

wherein z x - exp(-E/kt). Once a complete chain has been constructed, Zj 
values are calculated for each residue / in the completed chain. Initial values 
of <Zj> are based on values derived from 50 pre-completed polypeptide 
15 chains. To avoid replication factors that lead to the same basic fold, the 

value of (z/<z>) is set to 1 for the fixed central residue and at the N-terminal 
and C-terminal residues. 
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Once the enrichment factors are calculated for the just-completed chain, 
replication begins. At each residue / in the new polypeptide, if mj > 1, then 
the fragment up through the growth of residue / is replicated (m { - 1) times. 
For example, if m ; = 2, then one replication is performed since the original 
chain counts as one copy. Each of these replicated fragments is extended into 
complete polypeptide as described previously. These completed polypeptides 
are subject to replication at any residue in the freshly added growth where m { 
> 1 for the respective new polypeptide chains. In this manner, the growth of 
a single polypeptide chain leads to multiple generations of polypeptide chains 
that contain a central fragment from the parent polypeptide. 

In most preferred embodiments, a novel memory saving algorithm is used 
during the enrichment/replication stage. Once a complete polypeptide is 
constructed, values of m s are determined. The residue addition steps are 
backtracked in the opposite order in which the residues were added until a 
residue k is found for which m k > 1 . The protein fragment which 
incorporates all of the addition steps through the addition of residue k is 
replicated and an offspring polypeptide is created by adding to the newly- 
replicated fragment. Enrichment factors are calculated for the offspring chain 
and the value of m k for the parent chain is reduced by one since one of the 
replications that was to,take place at this residue has been completed. 

The backtracking through the parent chain continues until the value of m for 
each of the residues in the parent chain is 0 or 1. In this manner, each of the 
original values of m ; in the parent and every subsequent offspring is 
replicated (m- x - 1) times, while the coordinates of a single polypeptide is 
stored in memory. When this occurs, a new parent is constructed as 
previously described. 

In this manner, biased only van der Waals packing energy, a diverse 
ensemble of non-overlapping peptide chains with realistic peptide backbone 
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geometries can be rapidly generated. For example, more than 10 6 
conformations for a 50 residue protein was generated in one CPU day on a 
single processor Silicon Graphics Inc. R10000 workstation. 

Native Structure Search and Topology Verification 
5 The GP generated structure and the native structure were optimally . 

superimposed and the root mean squared deviation of the cc-carbons 
("CRMS") between the structures was calculated. Topological equivalence 
was determined by a rigorous, automated minimization procedure. Each a 
carbon in the candidate GP structure was tethered to the coordinates of the 
10 corresponding a carbon in the optimally superimposed native structure with a 

5 kcal/mol harmonic constraint energy. Minimization was performed on the 
constrained GP backbone using the Dreiding force field parameters which is 
described in Mayo et aL, J. Phys. Chem. 94: 8897 (1990) and which is 
incorporated herein by reference. 

If the GP structure and the native structure are topological^ equivalent, then 
during minimization, the GP structure should follow a direct trajectory toward 
the native structure. In other words, the GP structure should quickly 
minimize to the native coordinates. Topology differences are easily observed 
by the inability of the GP structure to minimize to the native coordinates 
since the force filed parameters do not permit covalent bond breakage in the 
peptide backbone or allow cooperative movements between non-local 
residues. The minimization trajectories demonstrate the conformational 
dynamics for a GP structure to assume a native fold. 

DISTANCE CONSTRAINED METHODS 

25 One implementation of the GP generation method which incorporates distance 

constraints is as follows. This modified method is used whenever an 
approximate distance between at least two residues is known. Because only 
structures which comply with this constraint are generated, the 
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implementation of the distance constraint is a powerful tool in reducing the 
number of candidate structures. 

In addition to protein structure prediction, the distance constrained GP 
method may be used with automated procedures for NMR peak sequence 
assignments. For example, using the distance constraints from a few 
unambiguously assigned peaks, GP conformations may be used to assist in 
the assignments of other peaks, which in turn can be used to further reduce 
the number of candidate conformations. This process can be reiterated until 
all observed peaks are assigned. A similar procedure may be developed for 
any other experimental or theoretical process which gives pair-wise or other 
structure information. Illustrative examples include but are not limited to 
spectroscopic labeling experiments, or x-ray intensity data wherein the 
Patterson function from Fourier transformation is used directly with the 
assignment of phases. 

It is preferred that a Ccc model is used during the initial stages with the Ca- 
Ca bond length, b, set to 3.8 A and the Ca-Ca-Ca angle, 9, set to 120°. As 
described previously, the Ca dihedrals, <)>, may either be determined using a 
Ccc version of a Ramachandran plot or a crude six state residue representation 
of <(. = to 0°, 60°, 120°, 180°, 240°, or 300°. 

Because a Ca representation is used, the coordinates for the first three 
residues are fixed from standard Ca bond length b and Ca-Ca-Ca angle 6. 
For example, if the coordinates of the first residue are set to 0,0,0, then the 
coordinates for the first three residue would be as follows: 

Cap (0.00, 0.00, 0.00). 

Ca i+1 = (3.80, 0.00, 0.00). 

Ca i+2 = (5.70, 3.29, 0.00). 
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Once the coordinates for the initial three contiguous residues are determined, 
the remaining coordinates are determined by the Ca dihedral <(>. As 
previously described, the probability of selecting one of the available pairs of 
dihedral angles during build up procedures is governed by: 




i 
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The addition energy, E h of a single residue is given by the summation of its 
pair-wise interaction energies with each residue in the peptide fragment. For 
all residues types, the energy of a residue pair is: 



{- 


12 


-2 


(r\ 6 


Joj 






W 



wherein Rq is set at 5.5 angstroms for all residue types; E 0 is 0.15 kcal/mol; 
R is the distance between the a-carbon of each residue; and / and j are not 
adjacent neighbors in sequence. 



The following is one implementation of the inventive method which takes 
one or more known distances into account as the structures are being 
generated. As it will be described in more detail below, the probability of 
selecting a residue addition step which would result in a structure which 
cannot meet the distance constraint is zero. 



Figure 8 is a representation of a peptide fragment in which residues M, /, 
i+l, z+2, /+3, and /+4 all line in the same plane. In other words, <|> i+1 , § i+ 
and <f> i+3 all are either 0° or 180°. For the purposes of illustration, a 
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cylindrical coordinate system is assumed with the z-axis travelling through 
the bond between residue M and residue /, and the z axis-origin at residue 
/-l. The radial axis, r, represents the perpendicular distance to the z-axis 
from any point in space. Given this coordinate system, it is possible to 
5 determine the maximum radial distance, r max , that residue (M)+n may be 

from the z-axis for a given value of z. The solid line connecting the possible 
in-plane coordinates for residue /+4 in Figure 8 shows the boundary of r max 
as a function of z for n = 5. 



Given this peptide representation, where 0 = 120° and b = 3.8 A, it is 
10 possible to write a general equation for r max in terms of z and n. When n is 

even, then for z > 3b/2, r max = r peak - (tan 30°)(z-(3b/2)) and for z < 3b/2, 
r max = r P eak + ( tan 30°)(z-(3b/2)). Here r peak = (n-l)(bsin60°), and z lies 
between {z min , z^} = {(-3b/4)(n-4), (3b/4)(n)}. When n is odd. then for z 
* b ' r max = r peak " 30°)(z-b) and for z < b, rmax = r peak + (tan 30°)(z-b). 
15 Consequently, r peak = (n-l)(bsin60°), and z lies between {z^, z max } = 

{(-b/4)(2+3(n-5)), (b/4)(4+3(n-l))}. By using these relationships, it is 
possible to determine the furthest position in (z, r) space that residue (M+n) 
may be placed from residues M and /. 



For example, consider a case in which a distance between residue j and k are 
20 known. For the purposes of illustration, assume that in the existing fragment, 

residue / has been fixed, residue k has yet to be added, and the residue to be 
added is residue /. Using the above equations, as residue / is to be placed, 
the furthest position in (z, r) space residue k can be placed from the position 
of residue M and residue / may be determined, given the sequence separation 
25 n = Hhl). 

For the purposes of illustration, assume that the distance between residue j 
and residue k is such that the maximum allowable that residue k may be 
placed from residue j is 6.58 A which is equivalent to the distance that may 
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be traversed in two residue addition steps (/.<? 2b(sin9/2) = 6.58 A). In other 
words, placing residue k within 6.58 A of residue j is equivalent to requiring 
that residue j lies in allowed (z, r) space for residue *+2. Hence, as the 
possible locations for residue / is examined, residue j must reside in allowed 
5 (z, r) space for the sequence separation n=£+2-(M) to also satisfy the 

distance constraint between residues k and j. If torsion $ { places residue / in 
a location outside of the allowed (z, r) space, then the distance constraint 
cannot possibly be satisfied, and the probability of selecting this torsion 
would be zero. 

10 As it can be seen from the above description, it is useful to associate a 

constraint bond order (o b ), which represents the number of residue addition 
steps required to span the constraint distance. Distances less than 3.8 A may 
be spanned by a single addition step of length b, hence the o b for this 
distance is 1. Distances up to 2b(sin 9/2) = 6.58 A may be spanned by two 

15 residue addition steps. In other words, for 3. 8 A < d <6.58 A, o b =2. Because 

most know distance constraints will be either derived from disulfide bond 
formation or NOESY NMR data, typical residue-residue distance constraints 
will be between about 3.8 A and about 7.4 A. Such distance constraints 
may be treated with o 5 =2, until the placement of the final residue in the 

20 constrained pair (i.e., placing residue k in the example above), where the 

actual distance constraint may be used in place of the above equations for z 
and r with n=3. 

Although the above example was illustrated with only one known distance, it 
is possible that a set of distance constraints affecting several residues in the 
25 protein sequence may be known. In the simplest case, a distance constraint is 

a first order constraint. This is the example previously discussed where 
residues j and fc are constrained by some distance and residue j is place prior 
to residue k and residue / is between j and k in sequence such that 
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i-j > k-i+o h -2. 

A second order constraint exists when two distance constraints are coupled 
such that both distance constraints cannot be satisfied by treating them 
independently as first order constraints. An example of a second order 
constraint arises, for example, in the following situation where the distance 
between residues 5 and 39 are constrained to be within o b =2 and the distance 
between residues 17 and 39 are also constrained to be within o b =2. The 
distance between residues 5 and 17 is considered a second order constraint 
with a 0^=4 since it is necessarily dependent on the distance constraints 
between residues 5 and 39 and the residues 17 and 39. In other words, in 
considering the possible placement of residue i, both the first and second 
order constraints must be satisfied. In the above example, for residues i-5 > 
17-/+o b -2 (i.e., residues 12, 13, 14, 15, 16, and 17), residue 5 must lie in 
allowed (z, r) space for n = 17+o b -(/-l), where o b =4. 

In summary, a list of inter-residue distance constraints (along with 
corresponding bond orders, o b , is inputted along with N, the total number of 
residues in the protein. The protein is then constructed residue by residue by 
moving to the right in sequence, beginning with the initial three N-terminal 
residues. For each step thereafter, the energy of each possible <j> angle is 
evaluated. If a distance constraint cannot be satisfied by a structure including 
the candidate torsion, then the probability of selecting this torsion angle is set 
to zero. If the distance constraint may be satisfied, then the probability of 
selecting this torsion is dependent on the Boltzmann energy as described 
previously. 

At a given residue addition step /, if no torsion satisfies the constraint 
conditions, then polypeptide is re-grown from residue i-4, in attempt to 
satisfy this constraint. The number of times this "backtracking" is performed 
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may be determined by the user. In a preferred implementation, one 
"backtrack" is allowed before the entire polypeptide is discarded and a new 
polypeptide grown from the initial three N-terminal residues. 

If desired, a "lookahead" strategy may also be used where the probability of 
5 selecting a torsion angle for residue / is biased by the placement of residue 

/+! . In this implementation, for a particular torsion candidate for residue /, 
potential torsions for residue z+1 are also explored to determine if constraints 
for residue i+l may be satisfied if the particular torsion for residue i is 
chosen. If this "lookahead" determines that there is no torsion for residue z+1 
10 which satisfies the constraints on residue z+1, then probability of selecting the 

particular candidate for residue z is set to zero. In preferred embodiments, 
favorable fragments are not replicated as in non-distance constrained 
generation methods. 



SAMPLING METHODS 

15 The sampling methods that are used are variations that are described by 

Sadanobu and Goddard, J. Chem. Phys. 106: 6702 (1997) which is 
incorporated in its entirety herein. Although the sampling methods described 
by Sadanobu and Goddard were developed for polymers, they are readily 
adapted to proteins, especially to calculations where a reduced representation 

20 of proteins is used. 



The Polymer Model 

A united atom model as described by Rychaert and Bellemans, Faraday 
Discuss. Chem. Soc. 66: 96 (1977) is used to represent the polymer or 
peptide backbone. If desired, the equations maybe adapted to include amino 
25 acid sidechains. In the model, each atom / in the chain is characterized by a 

Lennard- Jones 12-6 potential as in (1) 
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a 

w r v; 



12 



(i) 

lh _J ;th 



where e/k B = 72K, o = 0.3923 nm, and r } j is the distance between r 7 and / 
atoms. 



In addition, the torsion potential in (2) is included 



= £ -.(cos*/ 



"5 n=0 



(2a) 



10 



where 

a,, = 1.157, a, = 1.515, a 2 = -1.636, 
a 3 = -0.382, a 4 = 3.271, and a 5 = -3.927 
Here <)>,• is i' h torsion angle, and the geometry properties are taken as 
bond length : / = 0.153nm 
bond angle : 8 = 70.53° 
which corresponds to a carbon-carbon-carbon angle of 109.47°). 



(2b) 
(3) 



The total Hamiltonian has the form 

N i-4 



N 



HU* t n = £ E E u [^( «<mm + E *r(*i). 

i=5 y=i 



r=4 



(4) 



the results of which will be quoted in terms of a reduced temperature 
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k B T 

r = T' 

(5) 

The configurational partition function for the model polymer chain consisting 
of N carbon atoms is defined as 

Z N = / • • f exp[-p •#«♦,})] d$ 4 • • db N 

0 0 

(6) 



where P=1/£T. [Here <|>j is the torsion specifying the position of atom i with 
respect to atoms i-3, i-2, and i-1.] 

The Helmholtz free energy A, the potential energy E, and the entropy S, are 
given by 

A N = -p *lnZ N 

E N = /••/ H{i^))-exp[-^-H({^})]d^> 4 '-d^ N 
o o 

(E N -F N ) 

N j 

(7) 



Simple Sampling (SS) 

In the conventional direct Monte Carlo (DMC) method, polymer chains are 
generated by random step-by-step sampling of torsion angles. A complete N- 
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mer chain is constructed in sequence where the i th step samples the i th torsion 
to construct an i-mer chain. Then a new chain is set up and sampled again 
from scratch. This is referred to as simple sampling (SS). The partition 
function is evaluated by (8) 



Z N = iv; 1 *(27t)"- 3 £ exp[-p -J5f ({*,})] 



Here N c is total number of chains generated. 

The average value < f > of a physical property, f = f({$ t }), is 
calculated as in (9), 

</> = j _ 

i 



(8) 



(9) 



Independent Rotational Sampling (IRS) 

The sampling efficiency of SS-DMC is improved by applying rotationally 
biased sampling, in which torsions are sampled using a weighting function 
based on the Boltzmann factor of the torsion energy. This improvement to 
the simple sampling method is referred to as Independent Rotational 
Sampling (IRS). In the IRS method, the normalized torsion weighting 
function (TWF), W IRS , is defined as in (10) 
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"IRS 



(10a) 



where 



27t 



'IRS 



(10b) 



W*)=exp[-p£/(i))] 



(10c) 



Torsion angles are generated in accordance with (10a). The partition function 
for IRS after bias correction is evaluated by (1 1) 



N i-4 

-p-E E W 

i-5 j=l 



(11) 



W IRS need be calculated only once so that computational work involved in 
evaluating the partition function involves the Boltzmann factor for the 
nonbonding energy. With the IRS method, the use of W IRS effectively 
excludes high torsion energies throughout the MC sampling. Nevertheless, 
spatial overlaps between non-bonding atoms are inevitable, leading to high 



38 



WO 98/48270 



PCT/US98/08077 



configurational energies. In order to exclude these overlaps, information 
about the spatial environment in the vicinity of the growing chain end should 
be introduced into the TWF. The resulting form of the TWF, W*, is given 
by (12) 



^* (<»,; V • A_i> = 



g*(<fc4>4»->fy-i) 

Z*(4>4»-»<l> f -l) 



where 



2tc 



z * (* 4 , = fg * (4> f ; * 4 , ■,(() I ._ 1 W<|) J . 



i-4 



-P-E W 



(12a) 



(12b) 



(12c) 



The form of the partition function after bias correction becomes (13) 

N c ( N 

Z N = 'ZjRS ' £ 1 II Z * (*4»-»*,-i 0 



1 i=5 



(13) 



W* must be calculated at every step since it depends on all previous steps. 
The computation, time for this TWF is approximately proportional to the step 
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number, i; therefore, this sampling method becomes too expensive for 
systems containing a large number of atoms. 

Continuous Configurational Biased (CCB) Direct Monte Carlo 
Another novel sampling method, the Continuous Configurational Biased 
5 (CCB) direct Monte Carlo method, is described. In this variation, a- cutoff 

length for non-bonding interactions is introduced into the TWF calculation. 
On constructing the TWF for the i th torsion, a sphere of radius R c centered at 
the (i - l) th atom position is defined (See Figure 1). The length of R c should 
be taken larger than 1 + a, in order to ensure that all possible atomic overlaps 
10 are checked. Boltzmann factors for the non-bonding energy between I th atom 

and all other atoms inside the cut-off sphere are included in TWF, W CCB ,' as 
in (14) 

(14a) 



where 



2ir 



0 

(14b) 



(14c) 



and G(R) is the Heavyside step function 
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G(R) = 0ifR<0 
=1 if R> 0 



(15) 



10 



The computation time for W CCB is almost independent of i because the only 
non-bonding atoms considered are those in the local vicinity of a growing 
chain end. In addition, the list of atoms inside the cut-off circle for the i th 
atom is automatically available since all the necessary atomic distances were 
calculated to obtain the energy at the just previous step. The bias-corrected 
partition function has the form of (16), which includes the calculation of 
those non-bonding energies that did not appear in the TWF calculation of 
(14) 



AT 



Z N = K 1 'tins ' £ \ II Z CCB OlV'A-1.) 



N 



i=5 



exp 



i=5 y=l 



(16) 



The Continuous Configuration Boltzmann Biased fCC-BB) Method 
In the enrichment method for self-avoiding walks on a lattice, once a walk of 
i - 1 steps is successfully generated by the simple sampling method, this 
chain continues to be grown up to step i in different ways. In order to 
15 avoid bias, the enrichment factor m M is always fixed ahead of the Monte 

Carlo simulation. The total chain multiplicity, Mi, for step i is defined as 
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7=1 



(17) 



The chains obtained from a particular first monomer are not statistically 
independent. Hence, the set of all chains using the same seed as the first 
monomer are collected together and denoted as a cluster. Each cluster is then 
given the same weight. 

5 In the replication-deletion procedure (RDP) for a continuous space, chain 

enrichment is used to achieve a Boltzmann population for the collected 
chains. Here, Mj is determined at every step as statistically proportional to 
the ratio of the Boltzmann factor of step (i - 1) to that of step (i - 2), where 
mj is not integer. This leads to a high frequency of sampling chains with 

10 high energy (caused by nonbonding overlap) which are subsequently deleted 

in the course of sampling. The partition function is evaluated from the ratio 
of the total number of generated chains to the number of seeds. To avoid 
replicating chains too often, a scaling factor p is multiplied by Boltzmann 
factor. Since the suitable choice of scaling factors is unknown and strongly 

15 dependent on chain size and temperature, one determines them in trial and 

error manner prior to the Monte Carlo simulation. These scaling factors 
should be fixed ahead of Monte Carlo simulation. 

In the Continuous Configurational Biased Direct Monte Carlo (CCB-DMC) 
method, almost all high energy chains having non bonding overlaps are 
20 excluded. As a result, in comparison with replication-deletion procedures, 

chains are very seldom need to be deleted. However, the sampling 
distribution is not Boltzmann and low energy chains in the collection can be 
included with too high a contribution to the partition function. As a result, it 
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is preferred to extend the chain enrichment to control sampling so that all 
collected chains make a nearly equal contribution to the partition function. 

In this method, the multiplicity, Mj, is determined at every step as 
proportional to the ratio of the Boltzmann factor of a just-sampled chain to 
that of the running average value for the chain with same length. The 
partition function is explicitly calculated as the average of the weighting-bias- 
corrected Boltzmann factor divided by the chain multiplicity. In the 
Continuous Configuration Boltzmann Biased (CCBB) method, equation (16) 
is rewritten in terms of a sum over K clusters as 

K 

Z^K) = K- 1 •£ C^Q 

C=l 

(18) 



10 Denoting L n (C) as the total number of chains generated for cluster C cluster 

by using an arbitrary choice for the enrichment factor, the partition function 
(18) is calculated by (19) 



(19) 
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J=l 

(20) 



n=l 

(21) 



In CCBB, the chain multiplicity, Mj n (C), is determined as proportional to the 
ratio of<; n i .,(C)toZ i . 1 (C-l). 



<?"(C) 



Z M (C-1) 

(22) 



M t \C) = INTlO t n (Q] ifQ t \Q>l 
= 1 pQfrQ*! 



(23) 



This enrichment factor m"., is evaluated from the ratio of M n j to M".,. 

This procedure always keeps the chain multiplicity approximately 
5 proportional to the Boltzmann factor of the chain at the just-previous step. 
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P »(o = 

mUQ = INT[P"(C)] ifPi n (Q>l 

= 1 ifP-{C)<\ 

For i < 5 the Boltzmann population of the chain collection is completely 
satisfied in CCB. Therefore, chain multiplicity is set to unity as 



(24a) 



(24b) 



Mo 11 = Af/ 1 = «• = M 4 n = 1 

(24c) 



The choice of p is arbitrary. Too large a value of p could lead to an 
exploding number of samples of highly correlated configurations. Too small 
5 a value might lead to too few chains per cluster. For the polymer example 

considered here, p = 1 is used since it results in enriched chains having nearly 
equal contribution to the partition function. 

To obtain an initial guess for the partition function, Zj(0), a short non- 
Boltzmann Factor Biased (BFB) run is performed. A Boltzmann Factor 

10 Biased (BFB) method is an improved enrichment method, which introduces a 

configurational-dependent enrichment procedure with correct bias correction 
and automatic population control. (For this study, 200 chains were sampled 
prior to BFB sampling.) If the partition function used in the initial guess is 
too small, an extraordinarily large enrichment factor might occur for clusters 

15 after beginning the BFB sampling and ruin the MC sampling. To avoid this, 

an upper limit can be introduced for the enrichment factor (arbitrarily without 
any additional bias). For the example reported here, no special controls of 
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the enrichment factor were needed. Equation (24) gave automatic control of 
the number of chains generated by BFB sampling. The average number of 
generated chains per seed atom tended to become large at low temperature. 
It ranged from 3.5 at the highest temperature to 1 1 .2 at lowest one for N = 
5 400. 

CCB Sampling Procedure 

Prior to the chain sampling, the torsion energy was calculated for a fixed 
number of grid points. In this study, 200 equally separated grid points from 
0 to 27i)-W IRS were used. The normalized TWF for IRS, is then evaluated 
10 using numerical integration for z IRS as in (10b). This uses the auxiliary 

distribution Pi R s(<t>) in (25) 

*W4>) = / WW* W 

0 

(25) 



A local Cartesian reference frame is defined for each bond of the chain. The 



axial trans-formation matrix tj is 



cos0 sin6 0 

sinOcosfy -COSQsmfyi sincj),. 

sinGsind^. -cosGsiiK^. -cosfy. 



(26a) 



The first atom is set at origin and t 2 and t 3 are set as 
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'2 = 



1 0 0 
0 1 0 
0 0 1 





cos6 


sin0 


0 


h = 


sinG 


-cos0 


0 




0 


0 


1 



The position vector, R v of atom i is calculated as 



*=2 



*' = (1,0,0) 



(26b) 



(26c) 



Here, b is the bond vector and Tj is the transformation matrix from the local 
reference frame on the j th bond to the original reference frame. A random 
number uniformly distributed in the interval [0, 1), is drawn and the fourth 
torsion angle <|> 4 is obtained by requiring 

(27) 

For i > 4 after sampling the (i-l) th torsion, all non-bond distances are 
calculated to evaluate the energy and also to define an atom group {k};, 
whose elements consist of the neighbors of the (i-l) th atom 
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(28) 

The coordinates of all atoms in the list {k}j are transformed into the local 
reference frame on the (i - l) th bond by using the inverse matrix (T^)" 1 = 

1 i-1 

(29) 



In the local reference frame the coordinates of atom i for a trial move at the 
q* grid point <j)P q i is 



r ' = (/cos0, Zsin0 cos<j)f , Zsin0 sin(j)f) 



(30) 



which is independent of i if the same type of grid is used for all torsions. 
g CCB ((|»j) is evaluated as 



8 C cb($> 4> 4 , ,4>,_i) = W*?) * ex P 



(31) 
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Then z CCB and W CCB are evaluated by using the above expression Of g CCB . 
The auxiliary distribution P CCB is obtained by (32) 

p ca,(W = / »w+;V-.+i-iW 



(32) 



BFB Sampling Procedure 

In the BFB method, the number of chains that will be generated in a cluster 
5 cannot be foreseen. Thus, the amount of memory required to store the 

information for growing branches of the chains in a cluster cannot be 
predetermined a priori. As a result, a memory-saving algorithm is used, in 
which just one chain is grown at a time. 

The complete construction of one chain at time is as follows. For cluster k + 
10 1, the i index starts at i = 4 and increases to i = N. For each such i, each i' 

from i to N is considered. First, the enrichment factor is determined 
using (22) and evaluating the running average of the partition function, Z r (k 
+ 1). For each step i\ the chain generation counter F r is then defined and 
set to F r = m r . After calculating Fj from i' = i to i" = N, the calculation 
15 then starts at i" = N and work from N back to i. Each Fj«, is checked to 

determine if it is greater than unity. When F r > 1 for i" > i - 1, the i th 
torsion is sampled once more and F;„ is reduced by unity. A new chain is 
grown from step i" to N and a new value of m r . is evaluated for each step. 
The same procedure is repeated until there IS nO Tim larger than unity. At this 
20 point the (k + l)th cluster is completed, and the (k + 2)th cluster can be 

started. The flow chart of the method is illustrated by Figure 9. 

It is to be understood that while the invention has been described above in 
conjunction with preferred embodiments, the description and examples are 
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intended to illustrate and not limit the scope of the invention, which is 
defined by the scope of the appended claims. 
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What is claimed is: 

1 . A method for ab initio structure prediction for a n residue protein 
backbone, comprising: 

selecting a finite set of torsion angles and 
5 generating an ensemble of conformations for the protein backbone 

wherein the ensemble represents an exhaustive enumeration of self-avoiding 
backbone conformations for the selected set of torsion angles. 

2. The method as in claim 1 wherein the torsion angles are <f> \\t dihedral 
angles. 

10 3. The method as in claim 3 wherein the finite set of (|> iy dihedral angles 

include: (-65, -42); (-123, 139); (-70, 138); (-87, -47); (77, 22); and (107, 
-174). 

4. The method as in claim 1 wherein the torsion angles are Ca;-Ca i+1 - 
Ca i+3 -Ca i+4 dihedral angles. 

15 5. The method as in claim 1 further comprising 

calculating the radius of gyration and eliminating those conformations 

having values outside of the range defined by 8j # and S 2 ' 

wherein is 

*„u> r ) = -l-26 + 2.79n r I/3 , 

rif is the number of amino acid residues in the protein, 5j is between about 
20 0.9 and 1.0, and 8 2 is between about 1.4 and 1.5. 
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6. The method as in claim 1 further comprising 

reducing the number of conformations in the ensemble by considering 
sequence dependent distance constraints selected from the group consisting of 
disulfide bond requirements, metal coordination site requirements, and NMR 
5 derived NOE constraints. 



7. A method for determining the three dimensional backbone structure of 
a protein having an amino acid sequence of R r R 2 ...R n . 1 -R n , wherein n is the 
total number of amino acid residues in the sequence, comprising: 

selecting a finite set of <|> vj/ dihedral angles and 
10 generating an ensemble of conformations for the protein backbone 

wherein the ensemble represents an exhaustive enumeration of self-avoiding 
backbone conformations for the selected set of dihedral angles. 

8. The method as in claim 7 wherein the ensemble generation step 
includes: 

15 (a) selecting a residue / within the amino acid sequence; 

(b) initiating a growing fragment by determining the three dimensional 
backbone coordinates for residue /; 

(c) selecting a residue position that is adjacent to a residue whose 
three dimensional backbone coordinates have been previously determined; 

20 (d) picking a § \\f dihedral pair from among the finite set using a 

Metropolis-based sampling method; 

(e) determining the three dimensional backbone coordinates for the 
selected residue; 

(f) growing the fragment by adding the selected residue to the 
25 fragment; 

(g) calculating an addition energy wherein the addition energy is the 
summation of the pair wise interaction energies of the residues in the 
fragment; 
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(h) accepting or rejecting the backbone coordinates for the selected 
residue based upon an evaluation of the addition energy; and, 

(i) repeating steps (c)-(h) until the three dimensional backbone 
coordinates have been determined for each protein residue. 

9. The method as in claim 7 wherein the ensemble generation step 
includes: 

(a) selecting a residue / within the amino acid sequence; 

(b) initiating a fragment corresponding to residue / by determining the 
three dimensional backbone coordinates for residue /; 

(c) selecting an existing fragment and selecting a residue position for 
adding a residue on to the existing fragment; 

(d) picking a <J> i|/ dihedral pair from among the finite set using a 
Metropolis-based sampling method; 

(e) determining the three dimensional backbone coordinates for the 
selected residue; 

(f) growing the existing fragment by adding the selected residue 
thereon; 

(g) calculating an addition energy wherein the addition energy is the 
summation of the pair wise interaction energies of the residues in the 
fragment; 

(h) accepting or rejecting the backbone coordinates for the selected 
residue based upon an evaluation of the addition energy; 

(i) enriching the number of copies of the fragment which includes the 
coordinates of the selected residue; 

G) repeating steps (c)-(i) until every existing fragment has grown to n 
residues. 
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10. The method as in claim 9 wherein the addition energy is by 



(A 



12 



*0 



V ^/ J 



wherein Rq is between about 5 and 6 angstroms, and E 0 is between about 0.1 
kcal/mol and 0.2 kcal/mol for all residue types. 



10 



11. The method as in claim 9 wherein the number of copies made in the 
enrichment step is equal to 

wherein x is the total number of residues in the fragment which includes the 
selected residue; = exp(-E x /kt) is the Boltzmann factor for the fragment, 
<z x > is the accumulated average Boltzmann factor for all x length fragments; 
z x . 1 is the Boltzmann factor for the fragment without the addition of the 
selected residue; and <z x . 1 > is the accumulated average Boltzmann factor for 
all x-1 length fragments. 



15 



20 



12. The method as in claim 7 further comprising: 

(a) calculating a center of mass for conformation; 

(b) determining whether each residue is hydrophobic or non- 
hydrophobic; 

(c) assigning a penalty for each hydrophobic residue found outside a 
predetermined distance from the center of mass; and, 

(d) assigning a penalty for each hydrophilic residue found inside the 
predetermined distance from the center of mass. 



13. The method as in claim 7 further comprising: 

(a) calculating a center of mass for conformation; 
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(b) determining whether each residue is hydrophobic or non- 
hydrophobic; 

(c) assessing a penalty for each hydrophobic residue between about 
0.95 and 1.3 of the center of mass in which another residue is 

5 present between the hydrophobic residue and the center of mass, wherein 

= -1.26+ 2.79^, 



and iij. is the number of amino acid residues in the protein; and, 

(d) reducing the number of conformations in the ensemble by 
selecting conformations having the least number of penalties. 

10 14. The method as in claim 7 further comprising: 

reducing the number of conformations in the ensemble by calculating 
an energy for each conformation by: 

(a) calculating a center of mass for conformation; 

(b) calculating R,^ using the formula 

/UK) 3 " 1-26 + 2.79n r "\ 



15 wherein is the number of amino acid residues in the protein; 

(c) determining whether each residue is hydrophobic or non- 

hydrophobic; 

(d) assessing a first negative energy value for each 
hydrophobic residue within 0.95 R min of the center of mass; 

20 (c) assessing a second negative energy value for each 

hydrophobic residue between about 0.95 R min and 1 .3 R min of the center of 
mass in which another residue is not present between the hydrohphobic 
residue and the center of mass, wherein the first negative value is more 
negative than the second negative value; 
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(d) assessing a positive energy value for each hydrophobic 
residue between about 0.95 and 1.3 of the center of mass in which 
another residue is present between the hydrophobic residue and the center of 
mass; and, 

(e) adding the! values of the individual residues for selecting 
the lowest energy conformations. 



15. The method as in claim 7 further comprising: 

reducing the number of conformations in the ensemble by calculating 
an energy for each conformation by: 

(a) calculating a center of mass for conformation; 

(b) calculating using the formula 

^ nin (« r ) = -1.26 + 2.79« r " 3 ) 



wherein is the number of amino acid residues in the protein; 

(c) determining whether each residue is hydrophobic or non- 

hydrophobic; 

(d) assessing a first negative energy value for each 
hydrophobic residue within 0.95 R min of the center of mass; 

(c) assessing a second negative energy value for each 
hydrophobic residue between about 0.95 R min and 1.3 R min of the center of 
mass in which another residue is not present between the hydrohphobic 
residue and the center of mass, wherein the first negative value is more 
negative than the second negative value; 

(d) assessing a positive energy value for each hydrophobic 
residue between about 0.95 R min and 1.3 R min of the center of mass in which 
another residue is present between the hydrohphobic residue and the center of 
mass; 

(e) assessing a negative energy value for each hydrophilic 
residue that is surface accessible; 
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(f) adding the values of the individual residues for selecting the 
lowest energy conformations. 

16. A method for determining an alpha carbon backbone structure of a 
protein having an amino acid sequence of R r R 2 ...R n _ r R T1 , wherein n is the 
total number of amino acid residues in the sequence and having at least one 
distance constraint between two non-adjacent residues in sequence, 
comprising: 

selecting a finite set of Ca dihedral angles and 
generating an ensemble of conformations for the protein backbone 
wherein the ensemble represents an exhaustive enumeration of self-avoiding 
backbone conformations for the selected set of Ca dihedral angles. 

17. The method as in claim 16 wherein the ensemble generation step 
includes determining the coordinates of the first three N-terminal residues by 
setting the Ca-Ca bond length equal to 3.8 A and the Cot-Ca-Cot bond angle 
equal to 120°. 

18. The method as in claim 17 wherein the ensemble generation step 
further includes: 

(a) selecting a residue / in the sequence wherein residue /' is the next 
amino acid in sequence with respect to the previously placed residue; 

(b) selecting a Ca dihedral pair from among the finite set using a 
Metropolis-based sampling method; 

(c) determining the three dimensional backbone coordinates for the 
selected residue; 

(d) determining whether the placing the selected residue would allow 
the structure to satisfy the distance constraint; and, 

(e) rejecting the backbone coordinates if the structure could satisfy the 
distance constraint or accepting the backbone coordinates if the structure 
could not satisfy the distance constraint. 
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19. The method as in claim 18 wherein the backbone coordinates are 
rejected in step (e) of claim 17 and the ensemble generation step further 
includes repeating steps (b) through (e) until the backbone coordinates of the 
selected residue are accepted. 

20. The method as in claim 18 wherein the backbone coordinates are 
accepted in step (e) of claim 17 and the ensemble generation step further 
includes repeating steps (a) through (e). 

21. The method as in claim 16 wherein the generated ensemble is used 
with a procedure for determining NMR peak sequence assignments. 
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